STA6235: Modeling in Regression
Suppose we are faced with count data.
Fortunately, the Poisson distribution is appropriate for count data.
The Poisson regression model is as follows:
\ln\left( y \right) = \beta_0 + \beta_1 x_1 + ... + \beta_k x_k
candidatevotes: votes received by this candidate for this particular party
totalvotes: total number of votes cast for this election
year: year in which election was held
region: region that state belongs to, as defined by the US Census Bureau
library(tidyverse)
library(fastDummies)
house <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-11-07/house.csv') %>%
filter(stage == "GEN" &
party %in% c("REPUBLICAN", "DEMOCRAT") &
state != "DISTRICT OF COLUMBIA") %>%
mutate(candidatevotes = candidatevotes + 1,
year10 = year/10,
region = if_else(state %in% c("CONNECTICUT", "MAINE", "MASSACHUSETTS", "NEW HAMPSHIRE", "RHODE ISLAND", "VERMONT", "NEW JERSEY", "NEW YORK", "PENNSYLVANIA"), "Northeast",
if_else(state %in% c("ILLINOIS", "INDIANA", "MICHIGAN", "OHIO", "WISCONSIN", "IOWA", "KANSAS", "MINNESOTA", "MISSOURI", "NEBRASKA", "NORTH DAKOTA", "SOUTH DAKOTA"), "Midwest",
if_else(state %in% c("DELAWARE", "FLORIDA", "GEORGIA", "MARYLAND", "NORTH CAROLINA", "SOUTH CAROLINA", "VIRGINIA","WEST VIRGINIA", "ALABAMA", "KENTUCKY", "MISSISSIPPI", "TENNESSEE", "ARKANSAS", "LOUISIANA", "OKLAHOMA", "TEXAS"), "South", "West")))) %>%
dummy_cols(select_columns = c("region"))house %>% ## BOX PLOT 1 ##
ggplot(aes(y=candidatevotes, x=region)) +
geom_boxplot() +
theme_bw()
house %>% ## BOX PLOT 2 ##
filter(candidatevotes < 450000) %>%
ggplot(aes(y=candidatevotes, x=region)) +
geom_boxplot() +
theme_bw()
house %>% ## HISTOGRAM 1 ##
filter(candidatevotes < 450000) %>%
ggplot(aes(x=candidatevotes)) +
geom_histogram(bins = 45) +
theme_bw()
house %>% ## HISTOGRAM 2 ##
filter(candidatevotes < 450000) %>%
ggplot(aes(x=candidatevotes)) +
geom_histogram(bins = 45) +
theme_bw()glm() function when specifying the Poisson distribution.totalvotes in the model?
Call:
glm(formula = candidatevotes ~ year10 + party + region_Midwest +
region_Northeast + region_West + party:year10 + party:region_Midwest +
party:region_Northeast + party:region_West + totalvotes,
family = "poisson", data = house)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.619e+00 4.553e-03 -575.3 <2e-16 ***
year10 6.886e-02 2.277e-05 3024.3 <2e-16 ***
partyREPUBLICAN -4.398e+00 6.365e-03 -691.0 <2e-16 ***
region_Midwest 1.111e-01 8.373e-05 1327.2 <2e-16 ***
region_Northeast 1.204e-01 8.696e-05 1384.0 <2e-16 ***
region_West 1.122e-01 8.471e-05 1324.6 <2e-16 ***
totalvotes 1.587e-06 1.200e-10 13230.0 <2e-16 ***
year10:partyREPUBLICAN 2.239e-02 3.177e-05 704.8 <2e-16 ***
partyREPUBLICAN:region_Midwest -5.855e-02 1.161e-04 -504.2 <2e-16 ***
partyREPUBLICAN:region_Northeast -3.096e-01 1.273e-04 -2431.8 <2e-16 ***
partyREPUBLICAN:region_West -1.865e-01 1.213e-04 -1536.9 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 553349377 on 19566 degrees of freedom
Residual deviance: 382263499 on 19556 degrees of freedom
AIC: 382521177
Number of Fisher Scoring iterations: 5
(Intercept) year10
-2.619027e+00 6.885689e-02
partyREPUBLICAN region_Midwest
-4.398204e+00 1.111224e-01
region_Northeast region_West
1.203553e-01 1.122048e-01
totalvotes year10:partyREPUBLICAN
1.587455e-06 2.239381e-02
partyREPUBLICAN:region_Midwest partyREPUBLICAN:region_Northeast
-5.855104e-02 -3.096250e-01
partyREPUBLICAN:region_West
-1.864849e-01
\begin{align*} \ln(\hat{y}) =& -2.62 + 0.07 \text{ year} - 4.40 \text{ repub.} + 0.11 \text{ midwest} + 0.12 \text{ northeast} + 0.11 \text{ west} \\ & + 0.02 (\text{repub. $\times$ year}) \\ & - 0.06 (\text{repub. $\times$ midwest}) - 0.31 (\text{repub. $\times$ northeast}) - 0.19 (\text{repub. $\times$ west}) \end{align*}
\text{IRR}_i = \exp\left\{\hat{\beta}_i\right\}
This is a multiplicative effect, like an odds ratio in logistic regression.
An IRR > 1 indicates an increase in the expected count.
An IRR < 1 indicates a decrease in the expected count.
We also interpret the IRR similar to the odds ratio:
For a 1 [unit of predictor] increase in [predictor name], the expected count of [outcome] is multiplied by \left[ e^{\hat{\beta}_i} \right].
For a 1 [unit of predictor] increase in [predictor name], the expected count of [outcome] are [increased or decreased] by [100(e^{\hat{\beta}_i}-1)% or 100(1-e^{\hat{\beta}_i})%].
Every year, the expected count of votes increases by 10%.
Those in the midwest have an expected number of votes that is 1.05 times the votes for those in the south.
Those in the northeast have an expected number of votes that is 0.83 times the votes for those in the south.
Those in the west have an expected number of votes that is 0.93 times the votes for those in the south.
Every year, the expected count of votes increases by 7%.
Those in the midwest have an expected number of votes that is 1.12 times the votes for those in the south.
Those in the northeast have an expected number of votes that is 1.13 times the votes for those in the south.
Those in the west have an expected number of votes that is 1.12 times the votes for those in the south.
confint() function. 2.5 % 97.5 %
(Intercept) -2.627950e+00 -2.610104e+00
year10 6.881227e-02 6.890152e-02
partyREPUBLICAN -4.410679e+00 -4.385728e+00
region_Midwest 1.109583e-01 1.112865e-01
region_Northeast 1.201849e-01 1.205258e-01
region_West 1.120387e-01 1.123708e-01
totalvotes 1.587220e-06 1.587690e-06
year10:partyREPUBLICAN 2.233154e-02 2.245609e-02
partyREPUBLICAN:region_Midwest -5.877865e-02 -5.832343e-02
partyREPUBLICAN:region_Northeast -3.098746e-01 -3.093755e-01
partyREPUBLICAN:region_West -1.867227e-01 -1.862471e-01
We take the same approach to determining significance as before.
A single term in the model \to summary().
Multiple terms in the model \to full/reduced anova().
Call:
glm(formula = candidatevotes ~ year10 + party + region_Midwest +
region_Northeast + region_West + party:year10 + party:region_Midwest +
party:region_Northeast + party:region_West + totalvotes,
family = "poisson", data = house)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.619e+00 4.553e-03 -575.3 <2e-16 ***
year10 6.886e-02 2.277e-05 3024.3 <2e-16 ***
partyREPUBLICAN -4.398e+00 6.365e-03 -691.0 <2e-16 ***
region_Midwest 1.111e-01 8.373e-05 1327.2 <2e-16 ***
region_Northeast 1.204e-01 8.696e-05 1384.0 <2e-16 ***
region_West 1.122e-01 8.471e-05 1324.6 <2e-16 ***
totalvotes 1.587e-06 1.200e-10 13230.0 <2e-16 ***
year10:partyREPUBLICAN 2.239e-02 3.177e-05 704.8 <2e-16 ***
partyREPUBLICAN:region_Midwest -5.855e-02 1.161e-04 -504.2 <2e-16 ***
partyREPUBLICAN:region_Northeast -3.096e-01 1.273e-04 -2431.8 <2e-16 ***
partyREPUBLICAN:region_West -1.865e-01 1.213e-04 -1536.9 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 553349377 on 19566 degrees of freedom
Residual deviance: 382263499 on 19556 degrees of freedom
AIC: 382521177
Number of Fisher Scoring iterations: 5
f <- glm(candidatevotes ~ year10 + party + region_Midwest + region_Northeast + region_West + party:year10 + party:region_Midwest + party:region_Northeast + party:region_West + totalvotes, family = "poisson", data = house)
r <- glm(candidatevotes ~ year10 + party + region_Midwest + region_Northeast + region_West + party:year10 + totalvotes, family = "poisson", data = house)
anova(r, f, test = "LRT")Let’s create a couple of data visualizations.
Hold year constant at 1983.
Let total number of votes vary on the x-axis.
Define lines for each region…
# coefficients(m) # get coeff
# create full model before plugging in
#-2.619027 + 0.06885689*year10 -4.398204*repub + 0.1111224*midwest + 0.1203553*northeast + 0.1122048*west + 0.000001587455*totalvotes + 0.02239381*year10*repub - 0.05855104*midwest*repub - 0.3096250*northeast*repub - 0.1864849*west*repub
# year10 = 1983
#-2.619027 + 0.06885689*198.3 -4.398204*repub + 0.1111224*midwest + 0.1203553*northeast + 0.1122048*west + 0.000001587455*totalvotes + 0.02239381*198.3*repub - 0.05855104*midwest*repub - 0.3096250*northeast*repub - 0.1864849*west*repub
# republicans
#-2.619027 + 0.06885689*198.3 -4.398204*1 + 0.1111224*midwest + 0.1203553*northeast + 0.1122048*west + 0.000001587455*totalvotes + 0.02239381*198.3*1 - 0.05855104*midwest*1 - 0.3096250*northeast*1 - 0.1864849*west*1
# democrats
#-2.619027 + 0.06885689*198.3 -4.398204*0 + 0.1111224*midwest + 0.1203553*northeast + 0.1122048*west + 0.000001587455*totalvotes + 0.02239381*198.3*0 - 0.05855104*midwest*0 - 0.3096250*northeast*0 - 0.1864849*west*0house <- house %>%
mutate(r_mw = exp(-2.619027 + 0.06885689*198.3 -4.398204*1 + 0.1111224*1 + 0.1203553*0 + 0.1122048*0 + 0.000001587455*totalvotes + 0.02239381*198.3*1 - 0.05855104*1*1 - 0.3096250*0*1 - 0.1864849*0*1),
r_ne = exp(-2.619027 + 0.06885689*198.3 -4.398204*1 + 0.1111224*0 + 0.1203553*1 + 0.1122048*0 + 0.000001587455*totalvotes + 0.02239381*198.3*1 - 0.05855104*0*1 - 0.3096250*1*1 - 0.1864849*0*1),
r_w = exp(-2.619027 + 0.06885689*198.3 -4.398204*1 + 0.1111224*0 + 0.1203553*0 + 0.1122048*1 + 0.000001587455*totalvotes + 0.02239381*198.3*1 - 0.05855104*0*1 - 0.3096250*0*1 - 0.1864849*1*1),
r_s = exp(-2.619027 + 0.06885689*198.3 -4.398204*1 + 0.1111224*0 + 0.1203553*0 + 0.1122048*0 + 0.000001587455*totalvotes + 0.02239381*198.3*1 - 0.05855104*0*1 - 0.3096250*0*1 - 0.1864849*0*1),
d_mw = exp(-2.619027 + 0.06885689*198.3 -4.398204*0 + 0.1111224*1 + 0.1203553*0 + 0.1122048*0 + 0.000001587455*totalvotes + 0.02239381*198.3*0 - 0.05855104*1*0 - 0.3096250*0*0 - 0.1864849*0*0),
d_ne = exp(-2.619027 + 0.06885689*198.3 -4.398204*0 + 0.1111224*0 + 0.1203553*1 + 0.1122048*0 + 0.000001587455*totalvotes + 0.02239381*198.3*0 - 0.05855104*0*0 - 0.3096250*1*0 - 0.1864849*0*0),
d_w = exp(-2.619027 + 0.06885689*198.3 -4.398204*0 + 0.1111224*0 + 0.1203553*0 + 0.1122048*1 + 0.000001587455*totalvotes + 0.02239381*198.3*0 - 0.05855104*0*0 - 0.3096250*0*0 - 0.1864849*1*0),
d_s = exp(-2.619027 + 0.06885689*198.3 -4.398204*0 + 0.1111224*0 + 0.1203553*0 + 0.1122048*0 + 0.000001587455*totalvotes + 0.02239381*198.3*0 - 0.05855104*0*0 - 0.3096250*0*0 - 0.1864849*0*0))# republicans
house %>%
ggplot(aes(x = totalvotes, y = candidatevotes)) +
geom_point() +
geom_line(aes(y = r_ne), color = "pink") +
geom_line(aes(y = r_s), color = "hotpink") +
geom_line(aes(y = r_mw), color = "purple") +
geom_line(aes(y = r_w), color = "lightblue") +
theme_bw()
# democrats
house %>%
ggplot(aes(x = totalvotes, y = candidatevotes)) +
geom_point() +
geom_line(aes(y = d_ne), color = "pink") +
geom_line(aes(y = d_s), color = "hotpink") +
geom_line(aes(y = d_mw), color = "purple") +
geom_line(aes(y = d_w), color = "lightblue") +
theme_bw()Today we have covered the basics of Poisson regression.
On Thursday, we will learn about the assumptions of Poisson regression and what to do if we break that assumptions.
We will also learn about zero-inflated models.
In today’s assignment, you will remove the potential outlier and see how different the analysis results are.